1 Introduction

Airline satisfaction is a crucial aspect in the aviation industry as it directly impacts the loyalty and customer retention. Airline satisfaction data collected from Kaggle is used to conduct various exploratory data analysis for better understanding of passenger preferences in airline services. Several classification Machine Learning algorithms are applied to predict their passengers satisfaction. With the insights and conclusions from this analysis, we seek to improve the airline services and increase passenger satisfaction.

The dataset can be found here: Airline Passenger Satisfaction Dataset

2 Objectives

  1. To determine the main airline services that affects the passenger satisfaction.
  2. To understand the correlation between the services provided by Airline.
  3. To evaluate the performance among different classification models and predict the best model in terms of accuracy.

3 Methodology

3.1 Data Understanding

It is critical for understanding data that is available for mining and avoiding unexpected problems during data preparation. In this phase, the steps involved are collecting data, describing and exploring data and lastly to verify the data quality. Further details are elaborated in each step.

3.1.1 Libraries

# load library
library(ggcorrplot)
library(lattice)
library(stringr)
library(dplyr)
library(data.table)
library(ggplot2)
library(ggpubr)
library(tidyr)
library(ggrepel)
library(GGally)
library(vip)
library(patchwork)
library(magrittr)
library(caret)
library(devtools)
library(caTools)
library(party)
library(e1071)
library(class)

3.1.2 Airline Passenger Satisfaction Dataset

The dataset is uncleaned by modifying and removing some values, as well as replacing with inaccurate data via Microsoft Excel. The modified dataset will be used to perform data cleaning later.

airline_df <- read.csv("Airline Passenger Satisfaction Dataset\\train_unclean.csv"
                       , header = TRUE
                       , stringsAsFactors = TRUE)

Attributes and Dimension of the dataset

colnames(airline_df)
##  [1] "X"                                 "id"                               
##  [3] "Gender"                            "Customer.Type"                    
##  [5] "Age"                               "Type.of.Travel"                   
##  [7] "Class"                             "Flight.Distance"                  
##  [9] "Inflight.wifi.service"             "Departure.Arrival.time.convenient"
## [11] "Ease.of.Online.booking"            "Gate.location"                    
## [13] "Food.and.drink"                    "Online.boarding"                  
## [15] "Seat.comfort"                      "Inflight.entertainment"           
## [17] "On.board.service"                  "Leg.room.service"                 
## [19] "Baggage.handling"                  "Checkin.service"                  
## [21] "Inflight.service"                  "Cleanliness"                      
## [23] "Departure.Delay.in.Minutes"        "Arrival.Delay.in.Minutes"         
## [25] "satisfaction"
dim(airline_df)
## [1] 103904     25

The dataset contains 24 distinct features and 103,904 entries, including information on various aspects of the flight experience, such as in-flight Wi-Fi service, online booking ease, gate location, food and drink, seat comfort, in-flight entertainment, on-board service, legroom, baggage handling, check-in service, cleanliness, passenger demographics such as gender, age, and travel type, as well as the information on departure and arrival delay times.

First 5 rows in the dataset

head(airline_df)
##   X     id Gender     Customer.Type Age  Type.of.Travel    Class
## 1 0  70172   Male    Loyal customer  13  Personaltravel Eco Plus
## 2 1   5047   Male disloyal Customer  25  Businesstravel Business
## 3 2 110028 femALe    Loyal Customer  26 Business travel Business
## 4 3  24026 Female    Loyal Customer  25 Business travel Business
## 5 4 119299   Male    Loyal Customer  61 Business travel Business
## 6 5 111157 Female    Loyal Customer  26 Personal Travel     Ekon
##   Flight.Distance Inflight.wifi.service Departure.Arrival.time.convenient
## 1             460                   3.0                                 4
## 2             235                   3.5                                 2
## 3            1142                   2.0                                 2
## 4             562                   2.0                                 5
## 5             214                   3.0                                 3
## 6            1180                   3.5                                 4
##   Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 1                      3             1            5.0               3
## 2                      3             3            1.0               3
## 3                      2             2            5.0               5
## 4                      5             5            2.0               2
## 5                      3             3            4.0               5
## 6                      2             1            1.2               2
##   Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 1            5                      5                4                3
## 2            1                      1                1                5
## 3            7                      5                4                3
## 4            8                      2                2                5
## 5            9                      3                3                4
## 6           10                      1                3                4
##   Baggage.handling Checkin.service Inflight.service Cleanliness
## 1                4               4                5           5
## 2                3               1                4           1
## 3                4               4                4           5
## 4                3               1                4           2
## 5                4               3                3           3
## 6                4               4                4           1
##   Departure.Delay.in.Minutes Arrival.Delay.in.Minutes            satisfaction
## 1                         25                       18 neutral or dissatisfied
## 2                          1                        6 neutral or dissatisfied
## 3                          0                        0               satisfied
## 4                         11                        9 neutral or dissatisfied
## 5                          0                        0               satisfied
## 6                          0                        0 neutral or dissatisfied

Data Types for each attribute in the dataset

sapply(airline_df, class)
##                                 X                                id 
##                         "integer"                         "integer" 
##                            Gender                     Customer.Type 
##                          "factor"                          "factor" 
##                               Age                    Type.of.Travel 
##                         "integer"                          "factor" 
##                             Class                   Flight.Distance 
##                          "factor"                         "integer" 
##             Inflight.wifi.service Departure.Arrival.time.convenient 
##                         "numeric"                         "integer" 
##            Ease.of.Online.booking                     Gate.location 
##                         "integer"                         "integer" 
##                    Food.and.drink                   Online.boarding 
##                         "numeric"                         "integer" 
##                      Seat.comfort            Inflight.entertainment 
##                         "integer"                         "integer" 
##                  On.board.service                  Leg.room.service 
##                         "integer"                         "integer" 
##                  Baggage.handling                   Checkin.service 
##                         "integer"                         "integer" 
##                  Inflight.service                       Cleanliness 
##                         "integer"                         "integer" 
##        Departure.Delay.in.Minutes          Arrival.Delay.in.Minutes 
##                         "integer"                         "integer" 
##                      satisfaction 
##                          "factor"

The data is composed of categorical (nominal and ordinal) and numerical (continuous) variables, which serves as a tool for understanding passenger behavior and their level of satisfaction.

Summary of Dataset

summary(airline_df)
##        X                id            Gender                Customer.Type  
##  Min.   :     0   Min.   :     1   femALe:   28   disloyal Customer:18981  
##  1st Qu.: 25976   1st Qu.: 32534   Female:52699   Loyal customer   :   37  
##  Median : 51952   Median : 64857   mAlE  :   49   Loyal Customer   :84886  
##  Mean   : 51952   Mean   : 64924   Male  :51128                            
##  3rd Qu.: 77927   3rd Qu.: 97368                                           
##  Max.   :103903   Max.   :129880                                           
##                                                                            
##       Age                Type.of.Travel       Class       Flight.Distance
##  Min.   : 7.00   Business travel:71622   Bisnes  :   34   Min.   :  31   
##  1st Qu.:27.00   Businesstravel :   33   Business:49631   1st Qu.: 414   
##  Median :40.00   Personal Travel:32176   Eco     :20315   Median : 843   
##  Mean   :39.38   Personaltravel :   73   Eco Plus: 7494   Mean   :1189   
##  3rd Qu.:51.00                           Economy :26388   3rd Qu.:1743   
##  Max.   :85.00                           Ekon    :   42   Max.   :4983   
##                                                                          
##  Inflight.wifi.service Departure.Arrival.time.convenient Ease.of.Online.booking
##  Min.   :0.000         Min.   :0.00                      Min.   :0.000         
##  1st Qu.:2.000         1st Qu.:2.00                      1st Qu.:2.000         
##  Median :3.000         Median :3.00                      Median :3.000         
##  Mean   :2.731         Mean   :3.06                      Mean   :2.757         
##  3rd Qu.:4.000         3rd Qu.:4.00                      3rd Qu.:4.000         
##  Max.   :8.000         Max.   :5.00                      Max.   :5.000         
##                                                                                
##  Gate.location   Food.and.drink  Online.boarding  Seat.comfort   
##  Min.   :0.000   Min.   :0.000   Min.   :0.00    Min.   : 0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.00    1st Qu.: 2.000  
##  Median :3.000   Median :3.000   Median :3.00    Median : 4.000  
##  Mean   :2.977   Mean   :3.203   Mean   :3.25    Mean   : 3.443  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.00    3rd Qu.: 5.000  
##  Max.   :5.000   Max.   :5.500   Max.   :5.00    Max.   :20.000  
##                                                                  
##  Inflight.entertainment On.board.service Leg.room.service Baggage.handling
##  Min.   :0.000          Min.   :0.000    Min.   :0.000    Min.   :1.000   
##  1st Qu.:2.000          1st Qu.:2.000    1st Qu.:2.000    1st Qu.:3.000   
##  Median :4.000          Median :4.000    Median :4.000    Median :4.000   
##  Mean   :3.358          Mean   :3.382    Mean   :3.351    Mean   :3.632   
##  3rd Qu.:4.000          3rd Qu.:4.000    3rd Qu.:4.000    3rd Qu.:5.000   
##  Max.   :5.000          Max.   :5.000    Max.   :5.000    Max.   :5.000   
##                                                                           
##  Checkin.service Inflight.service  Cleanliness    Departure.Delay.in.Minutes
##  Min.   :0.000   Min.   :0.00     Min.   :0.000   Min.   :   0.00           
##  1st Qu.:3.000   1st Qu.:3.00     1st Qu.:2.000   1st Qu.:   0.00           
##  Median :3.000   Median :4.00     Median :3.000   Median :   0.00           
##  Mean   :3.304   Mean   :3.64     Mean   :3.286   Mean   :  14.83           
##  3rd Qu.:4.000   3rd Qu.:5.00     3rd Qu.:4.000   3rd Qu.:  12.00           
##  Max.   :5.000   Max.   :5.00     Max.   :5.000   Max.   :1592.00           
##                                                   NA's   :92                
##  Arrival.Delay.in.Minutes                  satisfaction  
##  Min.   :   0.00          neutral or dissatisfied:58879  
##  1st Qu.:   0.00          satisfied              :45025  
##  Median :   0.00                                         
##  Mean   :  15.18                                         
##  3rd Qu.:  13.00                                         
##  Max.   :1584.00                                         
##  NA's   :310

3.1.3 Data Checking

The purpose of data checking is to ensure that the data gathered is accurate, consistent, and complete to perform reliable analysis in the later stage.
Check duplicate data using the id attribute (unique identifier)

dim(airline_df[duplicated(airline_df$id),])[1]
## [1] 0

Check missing values in each attribute

colSums(is.na(airline_df))
##                                 X                                id 
##                                 0                                 0 
##                            Gender                     Customer.Type 
##                                 0                                 0 
##                               Age                    Type.of.Travel 
##                                 0                                 0 
##                             Class                   Flight.Distance 
##                                 0                                 0 
##             Inflight.wifi.service Departure.Arrival.time.convenient 
##                                 0                                 0 
##            Ease.of.Online.booking                     Gate.location 
##                                 0                                 0 
##                    Food.and.drink                   Online.boarding 
##                                 0                                 0 
##                      Seat.comfort            Inflight.entertainment 
##                                 0                                 0 
##                  On.board.service                  Leg.room.service 
##                                 0                                 0 
##                  Baggage.handling                   Checkin.service 
##                                 0                                 0 
##                  Inflight.service                       Cleanliness 
##                                 0                                 0 
##        Departure.Delay.in.Minutes          Arrival.Delay.in.Minutes 
##                                92                               310 
##                      satisfaction 
##                                 0

Check distinct values in each nominal attribute

levels(airline_df$Gender)
## [1] "femALe" "Female" "mAlE"   "Male"
levels(airline_df$Customer.Type)
## [1] "disloyal Customer" "Loyal customer"    "Loyal Customer"
levels(airline_df$Type.of.Travel)
## [1] "Business travel" "Businesstravel"  "Personal Travel" "Personaltravel"
levels(airline_df$Class)
## [1] "Bisnes"   "Business" "Eco"      "Eco Plus" "Economy"  "Ekon"

3.2 Data Cleaning

Having clean data will ultimately increase overall productivity and allow for the highest quality information in your decision-making. 92 and 310 missing values were detected in continuous data, which are the attributes of Departure.Delay.in.Minutes and Arrival.Delay.in.Minutes respectively and are cleaned by replacing with 0 value. Incorrect, misspell and structural error values are detected in nominal values such as Gender, Customer.Type, Type.of.Travel and Class attributes. Incorrect values are found in the ordinal type of attributes which are fixed using imputation method. All noisy data are cleaned, formatted and renamed accordingly. Irrelevant attributes are dropped to get the final clean data.

3.2.1 Nominal Data

Gender
Convert Gender into proper case

airline_df$Gender.C <-
  as.factor(str_to_title(airline_df$Gender))
levels(airline_df$Gender.C)
## [1] "Female" "Male"

Customer Type
Convert Customer Type into proper case

airline_df$Customer.Type.C <- 
  as.factor(str_to_title(airline_df$Customer.Type))
levels(airline_df$Customer.Type.C)
## [1] "Disloyal Customer" "Loyal Customer"

Type of Travel
Correct Type of Travel categories

airline_df$Type.of.Travel.C <- 
  ifelse(grepl('^b', airline_df$Type.of.Travel, ignore.case = TRUE), 
         "Business Travel",
         "Personal Travel")
airline_df$Type.of.Travel.C <- as.factor(airline_df$Type.of.Travel.C)
levels(airline_df$Type.of.Travel.C)
## [1] "Business Travel" "Personal Travel"

Class
Correct Class categories

airline_df$Class.C <- 
  ifelse(grepl('^b', airline_df$Class, ignore.case = TRUE), 
         "Business", "Eco")
airline_df$Class.C <- as.factor(airline_df$Class.C)
levels(airline_df$Class.C)
## [1] "Business" "Eco"

3.2.2 Ordinal Data

All attributes with rating are considered as ordinal data.
Inflight Wifi Service
Replace ratings above 5 as NA (missing values) then impute missing values with mean excluding 0

airline_df$Inflight.wifi.service.C <- airline_df$Inflight.wifi.service
airline_df$Inflight.wifi.service.C[airline_df$Inflight.wifi.service.C > 5] <- NA
airline_df$Inflight.wifi.service.C[is.na(airline_df$Inflight.wifi.service.C)] <-
  mean(airline_df$Inflight.wifi.service.C
       [airline_df$Inflight.wifi.service.C > 0]
       , na.rm = TRUE
       )

Convert to integer

airline_df$Inflight.wifi.service.C <- 
  round(airline_df$Inflight.wifi.service.C)

Food and Drink
Replace ratings above 5 as NA (missing values) then impute missing values with mean excluding 0

airline_df$Food.and.drink.C <- airline_df$Food.and.drink
airline_df$Food.and.drink.C[airline_df$Food.and.drink.C > 5] <- NA
airline_df$Food.and.drink.C[is.na(airline_df$Food.and.drink.C)] <-
  mean(airline_df$Food.and.drink.C
       [airline_df$Food.and.drink.C > 0]
       , na.rm = TRUE
       )

Convert to integer

airline_df$Food.and.drink.C <- 
  round(airline_df$Food.and.drink.C)

Seat Comfort
Replace ratings above 5 as NA (missing values) then impute missing values with mean excluding 0

airline_df$Seat.comfort.C <- airline_df$Seat.comfort
airline_df$Seat.comfort.C[airline_df$Seat.comfort.C > 5] <- NA
airline_df$Seat.comfort.C[is.na(airline_df$Seat.comfort.C)] <-
  mean(airline_df$Seat.comfort.C
       [airline_df$Seat.comfort.C > 0]
       , na.rm = TRUE
       )

Convert to integer

# convert to integer
airline_df$Seat.comfort.C <- 
  round(airline_df$Seat.comfort.C)

All attributes with ratings
Convert all ratings attributes to ordinal data

ratings <- c("Inflight.wifi.service.C", "Food.and.drink.C", "Seat.comfort.C",
          "Departure.Arrival.time.convenient",  "Ease.of.Online.booking",
          "Gate.location", "Online.boarding", "Inflight.entertainment",
          "On.board.service",   "Leg.room.service", "Baggage.handling",
          "Checkin.service",    "Inflight.service", "Cleanliness"
          )
airline_df[ratings] <- lapply(airline_df[ratings], 
                              factor, 
                              levels=c(0, 1, 2, 3, 4, 5)
                              )

3.2.3 Continuous Data

Departure & Arrival Delay in minutes
Replace NA with 0

delay_cols <- c("Departure.Delay.in.Minutes", "Arrival.Delay.in.Minutes")
airline_df[delay_cols][is.na(airline_df[delay_cols])] <- 0

3.2.4 Tidy Up

# rearrange columns
col_order <- c("X", "id", "Gender", "Gender.C", "Customer.Type",
               "Customer.Type.C", "Age", "Type.of.Travel", "Type.of.Travel.C",
               "Class", "Class.C","Flight.Distance", "Inflight.wifi.service", 
               "Inflight.wifi.service.C", "Departure.Arrival.time.convenient",  
               "Ease.of.Online.booking", "Gate.location", "Food.and.drink",
               "Food.and.drink.C", "Online.boarding", "Seat.comfort", 
               "Seat.comfort.C", "Inflight.entertainment", "On.board.service",  
               "Leg.room.service", "Baggage.handling", "Checkin.service",   
               "Inflight.service",  "Cleanliness", "Departure.Delay.in.Minutes",
               "Arrival.Delay.in.Minutes", "satisfaction"
               )
airline_df <- airline_df[, col_order]

# drop columns
drop_col <- c("Gender", "Customer.Type", "Type.of.Travel", "Class",
              "Inflight.wifi.service", "Food.and.drink", "Seat.comfort")
airline_df[, drop_col] <- list(NULL)

# rename columns
old_colnames <- c("Gender.C", "Customer.Type.C", "Type.of.Travel.C", "Class.C", 
                  "Inflight.wifi.service.C", "Food.and.drink.C", "Seat.comfort.C"
                  )
new_colnames <- drop_col
setnames(airline_df, old = old_colnames, new = new_colnames)

# drop "X" column
airline_df[, "X"] <- list(NULL)

3.3 Exploratory Data Analysis (EDA)

3.3.1 Overview

In this section, we will walk through an overall view of the airline satisfaction dataset by customer demographics and service ratings.

3.3.1.1 Customers

Below graphs represent an overview of the passengers demographics and characteristics, which will be used in further analysis.

Customer Gender

As seen in the pie chart, the distribution of genders among the passenger population is relatively balanced, with 50.7% of the passengers being male and 49.3% of the passengers being female. The slight difference of 1.4% between the two indicates that the data is not skewed towards one gender, allowing for a fair representation of both male and female passengers.

Customer Age

Most customers are either in their twenties or fourties. The age distribution of the passenger population shows a clear pattern of increasing numbers as age increases, peaking in the twenties range. The number then drops slightly in the thirties age range, before climbing back up to the highest peak in the forties. From there, the number of passengers gradually decreases as age increases. This pattern is clearly visible in the graph, providing an overall picture of the age distribution among the passenger population.

Customer Type

It can be observed that 18% of the customers are disloyal and 82% are loyal. This indicates that most of the customers are regular customers of the airline company. Thus, special loyalty marketing campaigns can be strategised to retain loyal customers for sustaining sales and profits.

Travel Type

Additionally, the data shows that 69% of the customers are traveling for business purposes and 31% are traveling for personal reasons. This provides insight on type of customers the company is serving, enabling targeted marketing and sales campaigns based on customer segmentation.

Flight Class

Lastly, regarding flight class, the data indicates that 47.8% of the customers are traveling in business class while 52.2% in economy class. This information can be used to understand the type of customers that are traveling in different classes, adjust pricing and inventory management strategies, and understand the revenue generated from different classes of customers.

Flight Arrival & Delay in Minutes

  • Based on the box plot below, we can conclude that both arrival and departure delay time mostly are below 15 minutes.

Overall, the data gives valuable insights on customer base and can be used for decision-making in business and improving customer satisfaction. It benefits the airline company with customer profiling, allowing them to create personalized services to their customers that helps in further growing their business.

3.3.1.2 Service Ratings

It’s observed that the check in Service, ease of online booking, gate location and inflight wifi service are not performing up to passengers’ expectation and need further improvement to satisfy customers.


Now, we will perform analysis from different perspectives, such as customer age, customer gender, customer type, travel type and flight class.

3.3.2 Customer Age

Chart above describes the satisfaction (count of satisfied customers) by age. It is evident from the plot that more number of younger customers (20–60 years) are satisfied with the airline, whereas a higher number of customers (10–60 years) are dissatisfied with the airline. Older customers were in general more satisfied with all features and rated them higher compared to younger travelers. This could indicate that younger travelers are more critical of factors and have higher expectations from products when compared to older/ middle aged population.

3.3.3 Customer Gender

Different perspectives from Customer Gender
From gender aspect, the satisfaction is indifferent for both gender thus we deep dive into the specific services vs gender.

Online.boarding across Gender

In comparison, female customers are more satisfied in the online boarding service than male customers as most of the ratings are above 3 ratings.

Seat.comfort across Gender

In comparison, female customers are more satisfied in the seat comfort aspect than male customers as most of the ratings are above 3 ratings.

Leg.room.service across Gender

Female passengers are more dissatisfied with the lower rating given to leg room service, as compared to male passengers.

Baggage.handling across Gender

Female passengers are more dissatisfied in baggage handling service than male passengers. Possible reason is that it is common for female passengers carry cosmetic products that are fragile items and thus negligence in handling with care cause them to be unhappy with the service.

Inflight.service across Gender

Female passengers are more dissatisfied in inflight service than male passengers.

3.3.4 Customer Type

Different perspectives from Customer Type

  • Both customers tend to choose the airline for business travel rather than personal travel, with a more distinct distribution of travel type in the disloyal customers group.
  • Disloyal customers prefer Eco class for business travel, while majority of loyal customers prefer Business class for business travel but Eco class for personal travel.

  • The distribution of age for gender in each customer type is similar.
  • The distribution of age for loyal and disloyal customers is unalike; widely-spread age distribution for loyal customers and non-symmetric bimodal age distribution for disloyal customers, with larger distribution in mid 20’s.
  • The quartiles of the loyal customers fall between 30’s to 50’s, whereas the quartiles of disloyal customers are more concentrated at mid 20’s to near 40’s.
  • Most disloyal customers are younger than loyal customers.

  • No significant difference in the satisfaction level of each customer type.
  • The number of customers whom satisfied with the service provided by the airline are slightly higher in each customer type group.

Since majority of the customers are loyal customers, hence we will focus on the ratings distribution of loyal customers. Looking into the distribution of ratings for loyal customers, most services have similar distribution of ratings in each customer type with gradually increase in the count from rating 1 to 4, peak at rating 4, and slight decrease in the rating count at rating 5. However, some services did not follow this pattern, such as ease of online booking, gate location and in-flight Wi-fi service, therefore we will focus on the distribution of ratings in these 3 services in the graphs below.

  • It is observed that most customers rated 2 and 3 for ease of online booking and gate location service, and rated 3 for gate location service.
  • This indicates that further investigation is required to examine the bad performance of online booking and gate location service, and immediate actions are required to improve the ratings for these two services.

3.3.5 Type of Travel

Different perspectives from Type of Travel

  • Business travelers tend to have a higher level of satisfaction and loyalty when compared to personal travelers. This could be due to a number of reasons such as the more comfortable and convenient travel arrangements offered by businesses to their employees, as well as the added perks and privileges that come with business travel such as upgrades and access to exclusive lounges.
  • However, it’s also worth noting that both types of travelers seem to have similar levels of dissatisfaction, raising concerns.

  • Business travelers tend to be older and have longer flight distances than personal travelers. This could be due to a number of factors such as the nature of their work requiring more frequent and longer-distance travel, or the fact that older individuals may be more likely to hold positions that involve traveling for business.

  • Additionally, business travelers may also have a higher disposable income and be more likely to upgrade to premium seating or business class, which can result in longer flight distances.

  • Furthermore, older travelers may also prefer longer flights as they offer more comfort and amenities.

  • From the perspective of “Type of Travel,” it is advised that Airlines should give more attention to the four services such as “In-flight Wi-fi service”, “Ease of Online Booking”, “Departure Arrival time convenient” and “Gate location” as they are identified to be major sources of dissatisfaction among business travelers.

  • By identifying and addressing these specific areas, Airlines can take effective measures to improve the overall travel experience for their customers, resulting in increased satisfaction and loyalty.

  • On the other hand, when it comes to Personal Travel, the service of “Departure Arrival Time Convenience” receives significantly high ratings and the other services still have a significant potential to improve as majority of the ratings for those services fall between “1” and “3” which indicates that those services are not performing as well as they could.

  • By taking initiative to improve these services, the ratings could be increased from “1” to “3” and from “3” to “5”, indicating a significant increase in passenger satisfaction and loyalty.

3.3.6 Class

Different perspectives from Flight Class

  • The distribution of passengers between business and economy class is roughly equal, with a nearly equal number of male and female passengers in both classes.

  • In addition, the customer demographic in both business and economy class is similar, with the majority of passengers being Loyal Customers and a small percentage being Disloyal Customers.

  • In economy class, there is a higher proportion of passengers aged below 30 and between 60 to 70 years old, whereas in business class, there is a higher proportion of passengers aged between 30 to 60 years old. The demographic differences between economy and business class passengers could be due to various factors such as pricing, travel purpose, life stage and perception of comfort.

  • Younger passengers might choose Eco class due to cost concerns, as it is typically the most affordable option. They might also be more likely to be traveling for leisure or personal reasons rather than for work, leading them to prioritize cost savings over amenities. Alternatively, younger passengers might prioritize other aspects of their trip, such as destination or duration, and choose the Eco class in order to allocate more of their budget to those factors.

  • Business class is often chosen by adults for the enhanced comfort it provides, such as more spacious seating and additional amenities. It may also be preferred for the convenience it offers, including priority services at the airport and access to exclusive lounges. The ability to work or relax in a more comfortable setting during long flights can also be a factor in the choice of business class. Additionally, some adults may be attracted to the prestige associated with flying in this class.

  • According to the graph, a significant number of business class passengers are traveling for business purposes. While the Eco class also has a proportion of business travelers, there is a slightly higher percentage of passengers traveling for personal reasons.

  • From the graph, we can see that passengers often choose business class for further flights. It may because of the added comfort and amenities on long flights.

  • Passengers in the business class are generally happy with the airline’s services, but there is still room for improvement. On the other hand, many passengers in the economy classes were not satisfied or were neutral about the services they received. To make sure that all passengers have a good experience, it is important to focus on improving the services provided to the economy and economy plus classes.

  • From the graphs presented above, the areas that generate the most dissatisfaction among passengers for business class are inflight wifi service, ease of online booking, gate location, and food and drink, which are similar to the type of travel analysed outcome.To enhance the passenger experience and improve the airline’s image and values, these areas require improvement.
  • Furthermore, for economy class, additional areas such as seat comfort, inflight entertainment, on-board service, leg room, check-in service, and cleanliness also need attention as they frequently receive low ratings and affect the overall passenger experience.

3.4 Classification

Classification is the process of predicting the target label based on features in the dataset. In this project, we will build classification model using different algorithms to predict the satisfaction of customers (either satisfied or neutral/dissatisfied) based on available features such as customer demographics, flight details, service ratings etc. The algorithms used are:

  • Logistic Regression
  • Decision Tree
  • K-Nearest Neighbour (KNN)

3.4.1 Handle Categorical Data

Convert categorical data to numerical data using:
Ordinal Data
Convert Ordinal to Integer

airline_df_m <- airline_df
ord_ratings <- c("Inflight.wifi.service", "Food.and.drink", "Seat.comfort",
          "Departure.Arrival.time.convenient",  "Ease.of.Online.booking",
          "Gate.location", "Online.boarding", "Inflight.entertainment",
          "On.board.service",   "Leg.room.service", "Baggage.handling",
          "Checkin.service",    "Inflight.service", "Cleanliness"
          )
airline_df_m[ord_ratings] <- 
  lapply(airline_df_m[ord_ratings], 
         function(x) as.numeric(as.character(x))
         )

Nominal Data
One-hot encoding to convert each category into a new column and assign a value as 1 or 0 to the column

# One-Hot encoding
features <- c("Gender", "Customer.Type",    "Age", "Type.of.Travel", "Class",   
              "Flight.Distance", ord_ratings, "Departure.Delay.in.Minutes",
              "Arrival.Delay.in.Minutes")
airline_df_m_trsf <- 
  cbind(data.frame(predict(dummyVars(
          ~.,
          data = airline_df_m),
          airline_df_m)
          ), 
        satisfaction = airline_df_m$satisfaction
        )
airline_df_m_trsf <- subset(airline_df_m_trsf, 
                            select = -c(satisfaction.neutral.or.dissatisfied,
                                        satisfaction.satisfied, id))

# Remove perfectly correlated variables
airline_df_m_trsf <- subset(airline_df_m_trsf, 
                            select = -c(Gender.Male, Customer.Type.Loyal.Customer, 
                                        Type.of.Travel.Personal.Travel, Class.Eco))

# Set satisfaction as factors
airline_df_m_trsf$satisfaction <- factor(airline_df_m_trsf$satisfaction)

3.4.2 Correlation

Further investigation on the correlation between different airline services and overall satisfaction. It is observed that the class and online boarding service have a stronger positive correlation with the overall satisfaction, i.e. more than 0.5 Pearson correlation coefficient. Passengers rely on the type of flight class and the online boarding service to rate higher satisfaction for the flight.

#Changing Satisfaction attribute to numeric
airline_corr<- airline_df_m_trsf %>%
  select(everything()) %>%   mutate_if(is.factor,as.numeric)
str(airline_corr)
## 'data.frame':    103904 obs. of  23 variables:
##  $ Gender.Female                    : num  0 0 1 1 0 1 0 1 1 0 ...
##  $ Customer.Type.Disloyal.Customer  : num  0 1 0 0 0 0 0 0 0 1 ...
##  $ Age                              : num  13 25 26 25 61 26 47 52 41 20 ...
##  $ Type.of.Travel.Business.Travel   : num  0 1 1 1 1 0 0 1 1 1 ...
##  $ Class.Business                   : num  0 1 1 1 1 0 0 1 1 0 ...
##  $ Flight.Distance                  : num  460 235 1142 562 214 ...
##  $ Inflight.wifi.service            : num  3 4 2 2 3 4 2 4 1 4 ...
##  $ Departure.Arrival.time.convenient: num  4 2 2 5 3 4 4 3 2 3 ...
##  $ Ease.of.Online.booking           : num  3 3 2 5 3 2 2 4 2 3 ...
##  $ Gate.location                    : num  1 3 2 5 3 1 3 4 2 4 ...
##  $ Food.and.drink                   : num  5 1 5 2 4 1 2 3 4 2 ...
##  $ Online.boarding                  : num  3 3 5 2 5 2 2 5 3 3 ...
##  $ Seat.comfort                     : num  5 1 3 3 3 3 3 3 3 3 ...
##  $ Inflight.entertainment           : num  5 1 5 2 3 1 2 5 1 2 ...
##  $ On.board.service                 : num  4 1 4 2 3 3 3 5 1 2 ...
##  $ Leg.room.service                 : num  3 5 3 5 4 4 3 5 2 3 ...
##  $ Baggage.handling                 : num  4 3 4 3 4 4 4 5 1 4 ...
##  $ Checkin.service                  : num  4 1 4 1 3 4 3 4 4 4 ...
##  $ Inflight.service                 : num  5 4 4 4 3 4 5 5 1 3 ...
##  $ Cleanliness                      : num  5 1 5 2 3 1 2 4 2 2 ...
##  $ Departure.Delay.in.Minutes       : num  25 1 0 11 0 0 9 4 0 0 ...
##  $ Arrival.Delay.in.Minutes         : num  18 6 0 9 0 0 23 0 0 0 ...
##  $ satisfaction                     : num  1 1 2 1 2 1 1 2 1 1 ...
#Creating correlation between different airline services and overall satisfaction.
df_corr<-cor(airline_corr[-1], airline_corr$satisfaction) 
colnames(df_corr)[1] <- "Satisfaction"
df_corr
##                                    Satisfaction
## Customer.Type.Disloyal.Customer   -0.1876381714
## Age                                0.1371673050
## Type.of.Travel.Business.Travel     0.4490004498
## Class.Business                     0.5038484626
## Flight.Distance                    0.2987797858
## Inflight.wifi.service              0.2838195658
## Departure.Arrival.time.convenient -0.0516006177
## Ease.of.Online.booking             0.1717049785
## Gate.location                      0.0006820275
## Food.and.drink                     0.2097962131
## Online.boarding                    0.5035573216
## Seat.comfort                       0.3492676800
## Inflight.entertainment             0.3980594211
## On.board.service                   0.3223825215
## Leg.room.service                   0.3131308017
## Baggage.handling                   0.2477493654
## Checkin.service                    0.2361737428
## Inflight.service                   0.2447407387
## Cleanliness                        0.3051980118
## Departure.Delay.in.Minutes        -0.0504942103
## Arrival.Delay.in.Minutes          -0.0574353182
## satisfaction                       1.0000000000
#Creating correlation heatmap between different attributes.
corr_mat <- round(cor(airline_corr),2)
head(corr_mat)
##                                 Gender.Female Customer.Type.Disloyal.Customer
## Gender.Female                            1.00                            0.03
## Customer.Type.Disloyal.Customer          0.03                            1.00
## Age                                     -0.01                           -0.28
## Type.of.Travel.Business.Travel           0.01                            0.31
## Class.Business                          -0.01                           -0.09
## Flight.Distance                         -0.01                           -0.23
##                                   Age Type.of.Travel.Business.Travel
## Gender.Female                   -0.01                           0.01
## Customer.Type.Disloyal.Customer -0.28                           0.31
## Age                              1.00                           0.05
## Type.of.Travel.Business.Travel   0.05                           1.00
## Class.Business                   0.14                           0.55
## Flight.Distance                  0.10                           0.27
##                                 Class.Business Flight.Distance
## Gender.Female                            -0.01           -0.01
## Customer.Type.Disloyal.Customer          -0.09           -0.23
## Age                                       0.14            0.10
## Type.of.Travel.Business.Travel            0.55            0.27
## Class.Business                            1.00            0.47
## Flight.Distance                           0.47            1.00
##                                 Inflight.wifi.service
## Gender.Female                                   -0.01
## Customer.Type.Disloyal.Customer                 -0.01
## Age                                              0.02
## Type.of.Travel.Business.Travel                   0.10
## Class.Business                                   0.03
## Flight.Distance                                  0.01
##                                 Departure.Arrival.time.convenient
## Gender.Female                                               -0.01
## Customer.Type.Disloyal.Customer                             -0.21
## Age                                                          0.04
## Type.of.Travel.Business.Travel                              -0.26
## Class.Business                                              -0.10
## Flight.Distance                                             -0.02
##                                 Ease.of.Online.booking Gate.location
## Gender.Female                                    -0.01          0.00
## Customer.Type.Disloyal.Customer                  -0.02          0.01
## Age                                               0.02          0.00
## Type.of.Travel.Business.Travel                    0.13          0.03
## Class.Business                                    0.11          0.00
## Flight.Distance                                   0.07          0.00
##                                 Food.and.drink Online.boarding Seat.comfort
## Gender.Female                            -0.01            0.04         0.03
## Customer.Type.Disloyal.Customer          -0.06           -0.19        -0.16
## Age                                       0.02            0.21         0.16
## Type.of.Travel.Business.Travel            0.06            0.22         0.12
## Class.Business                            0.09            0.33         0.23
## Flight.Distance                           0.06            0.21         0.16
##                                 Inflight.entertainment On.board.service
## Gender.Female                                    -0.01            -0.01
## Customer.Type.Disloyal.Customer                  -0.11            -0.06
## Age                                               0.08             0.06
## Type.of.Travel.Business.Travel                    0.15             0.06
## Class.Business                                    0.20             0.22
## Flight.Distance                                   0.13             0.11
##                                 Leg.room.service Baggage.handling
## Gender.Female                              -0.03            -0.04
## Customer.Type.Disloyal.Customer            -0.05             0.02
## Age                                         0.04            -0.05
## Type.of.Travel.Business.Travel              0.14             0.03
## Class.Business                              0.21             0.17
## Flight.Distance                             0.13             0.06
##                                 Checkin.service Inflight.service Cleanliness
## Gender.Female                             -0.01            -0.04       -0.01
## Customer.Type.Disloyal.Customer           -0.03             0.02       -0.08
## Age                                        0.04            -0.05        0.05
## Type.of.Travel.Business.Travel            -0.02             0.02        0.08
## Class.Business                             0.16             0.17        0.14
## Flight.Distance                            0.07             0.06        0.09
##                                 Departure.Delay.in.Minutes
## Gender.Female                                         0.00
## Customer.Type.Disloyal.Customer                       0.00
## Age                                                  -0.01
## Type.of.Travel.Business.Travel                        0.01
## Class.Business                                       -0.01
## Flight.Distance                                       0.00
##                                 Arrival.Delay.in.Minutes satisfaction
## Gender.Female                                       0.00        -0.01
## Customer.Type.Disloyal.Customer                     0.00        -0.19
## Age                                                -0.01         0.14
## Type.of.Travel.Business.Travel                      0.01         0.45
## Class.Business                                     -0.01         0.50
## Flight.Distance                                     0.00         0.30
ggcorrplot::ggcorrplot(corr_mat, hc.order = TRUE, type = "lower",
   lab = TRUE,lab_size =2,
title="Correlations in Airline Passenger Satisfaction Dataset",
legend.title = "Pearson \n Corr")+
theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 8, hjust = 1))+
theme(axis.text.y = element_text(vjust = 1, 
                                   size = 8, hjust = 1))

3.4.3 Train and Test Dataset Splitting

Data is split into train and test dataset (80:20), 82,258 and 21,646 records respectively.

# Splitting dataset
split <- sample.split(airline_df_m_trsf, SplitRatio = 0.8)
train <- subset(airline_df_m_trsf, split == "TRUE")
test <- subset(airline_df_m_trsf, split == "FALSE")

We will build the classification model using several classifiers, which is Logistic Regression, Decision Trees and K-Nearest Neighbors (KNN), then evaluate, discuss and compare the results accordingly.

3.4.4 Logistic Regression

Logistic Regression is a regression algorithm that predicts the value of target variable (dependent variable) by investigating the relationship between the independent variable and the dependent variable. There are few important assumptions for regression, which is appropriate outcome structure, observations independence, little or no multicollinearity among the independent variables, linearity of independent variables and log odds and large sample size.

Fit logistic regression model using the glm (generalized linear model) function
Model Summary:

logit_model <- glm(satisfaction ~.,
                   family = binomial(link='logit'),
                   data = train)
summary(logit_model)
## 
## Call:
## glm(formula = satisfaction ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8420  -0.4923  -0.1760   0.3862   4.0048  
## 
## Coefficients:
##                                     Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)                       -9.239e+00  8.829e-02 -104.633  < 2e-16 ***
## Gender.Female                     -5.853e-02  2.199e-02   -2.662 0.007777 ** 
## Customer.Type.Disloyal.Customer   -2.050e+00  3.358e-02  -61.058  < 2e-16 ***
## Age                               -8.644e-03  7.992e-04  -10.815  < 2e-16 ***
## Type.of.Travel.Business.Travel     2.713e+00  3.523e-02   77.012  < 2e-16 ***
## Class.Business                     7.466e-01  2.799e-02   26.673  < 2e-16 ***
## Flight.Distance                   -2.356e-05  1.274e-05   -1.849 0.064418 .  
## Inflight.wifi.service              3.930e-01  1.293e-02   30.399  < 2e-16 ***
## Departure.Arrival.time.convenient -1.272e-01  9.284e-03  -13.700  < 2e-16 ***
## Ease.of.Online.booking            -1.391e-01  1.277e-02  -10.887  < 2e-16 ***
## Gate.location                      2.689e-02  1.037e-02    2.594 0.009491 ** 
## Food.and.drink                    -2.479e-02  1.205e-02   -2.058 0.039609 *  
## Online.boarding                    6.162e-01  1.159e-02   53.180  < 2e-16 ***
## Seat.comfort                       7.235e-02  1.262e-02    5.732 9.94e-09 ***
## Inflight.entertainment             6.250e-02  1.615e-02    3.869 0.000109 ***
## On.board.service                   3.041e-01  1.149e-02   26.458  < 2e-16 ***
## Leg.room.service                   2.509e-01  9.636e-03   26.043  < 2e-16 ***
## Baggage.handling                   1.402e-01  1.292e-02   10.856  < 2e-16 ***
## Checkin.service                    3.172e-01  9.653e-03   32.864  < 2e-16 ***
## Inflight.service                   1.158e-01  1.359e-02    8.520  < 2e-16 ***
## Cleanliness                        2.225e-01  1.364e-02   16.316  < 2e-16 ***
## Departure.Delay.in.Minutes         4.484e-03  1.038e-03    4.321 1.55e-05 ***
## Arrival.Delay.in.Minutes          -9.090e-03  1.028e-03   -8.840  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 111243  on 81315  degrees of freedom
## Residual deviance:  54330  on 81293  degrees of freedom
## AIC: 54376
## 
## Number of Fisher Scoring iterations: 6
  • Most of the predictors are statistically significant to the customer satisfaction (p < 0.05), except Gender and Flight Distance.
  • Looking at the null and residual deviance, it is shown that the difference between these two deviance is huge, thus can conclude that the model is a good fit.

Model Prediction & Evaluation
Model evaluation using confusion matrix.

# Predict test data
predict_reg <- predict(logit_model, 
                       subset(test, 
                              select = -c(satisfaction)
                              ),
                       type = "response")
predict_reg <- ifelse(predict_reg >0.5, "satisfied", "neutral or dissatisfied")

# Confusion matrix
confusionMatrix(factor(predict_reg), test$satisfaction)
## Confusion Matrix and Statistics
## 
##                          Reference
## Prediction                neutral or dissatisfied satisfied
##   neutral or dissatisfied                   11514      1589
##   satisfied                                  1222      8263
##                                                  
##                Accuracy : 0.8756                 
##                  95% CI : (0.8712, 0.8798)       
##     No Information Rate : 0.5638                 
##     P-Value [Acc > NIR] : < 2.2e-16              
##                                                  
##                   Kappa : 0.7459                 
##                                                  
##  Mcnemar's Test P-Value : 5.084e-12              
##                                                  
##             Sensitivity : 0.9041                 
##             Specificity : 0.8387                 
##          Pos Pred Value : 0.8787                 
##          Neg Pred Value : 0.8712                 
##              Prevalence : 0.5638                 
##          Detection Rate : 0.5097                 
##    Detection Prevalence : 0.5801                 
##       Balanced Accuracy : 0.8714                 
##                                                  
##        'Positive' Class : neutral or dissatisfied
## 
  • Based on the confusion matrix, it is observed that the accuracy of the model is high with roughly 88%, and the accuracy is much higher than the no information rate (percentage of the data with the majority class).
  • The model also shows high sensitivity and specificity, which means it can correctly predict the customer satisfaction, and only less than 17% of the customers satisfaction are wrongly predicted.
  • The Kappa statistic is around 0.75, indicate excellent agreement between individuals.

3.4.5 Decision Tree

Decision tree is a non-parametric supervised learning algorithm for classification and prediction. It is structured as a hierarchical flowchart-like tree where:

  • each internal node represents a test on an attribute
  • each branch represents a test outcome
  • each leaf node (terminal node) holds a class label

The splitting of source set into subset is repeated on each derived subset and the recursion is completed when the splitting no longer adds value to the prediction. Decision Tree model does not require parameter setting and provide a clear indication of which attributes are most important for prediction.

Build the decision tree model using ctree.

dt_model <- ctree(satisfaction ~ ., train)
# plot(dt_model)

Model Prediction & Evaluation
Model evaluation using confusion matrix.

# Predict test data
predict_dt <- predict(dt_model, test)

# Confusion matrix
dt_m <- table(test$satisfaction, predict_dt)
dt_m
##                          predict_dt
##                           neutral or dissatisfied satisfied
##   neutral or dissatisfied                   12285       451
##   satisfied                                   730      9122
dt_ac <- sum(diag(dt_m)) / sum(dt_m)
print(paste('Accuracy =', dt_ac))
## [1] "Accuracy = 0.947715601204179"

The accuracy score for Decision Tree model is around 95%, which is relatively high. From the confusion matrix, the model has correctly predicted that 8,717 passengers are satisfied with the services provided, while 11,836 passengers rated neutral or dissatisfied. By analogy, the model misclassified 436 as happy passengers and 658 as unhappy passengers.

3.4.6 K-Nearest Neighbour (KNN)

The K-Nearest Neighbors (KNN) algorithm is a simple, yet powerful, supervised machine learning algorithm. It is used for both classification and regression problems. In classification, the idea is to find the k-number of nearest data points in the training set for a new data point, and then predict the class of the new data point based on the majority class among its k-nearest neighbors.

Fit the KNN model

  • Scaling the features is carried to ensures that all features are on the same scale, so that one feature does not dominate the distance measure.
  • The square root of the total number of observations is commonly used to determine the optimal k value as it balances the trade-off between overfitting and underfitting. Overfitting occurs when a model is too complex and is fitting the noise in the training data, while underfitting occurs when a model is too simple and is not fitting the training data well.
  • After running the code, it shows that k value is approximate 286.808298345777 and k value 286 & 287 are further being chosen for our model prediction.
# Feature scaling
# Normalize the range of independent variables or features of data
train_scale <- scale(train[, 1:22])
test_scale <- scale(test[, 1:22])

# Determine the optimal k value using square root of total observations
# sqrt(NROW(train))
# 285.1596

# KNN (try 285, 286)
knn_model_285 <- knn(train = train_scale,
                     test = test_scale,
                     cl = train$satisfaction,
                     k = 285)
knn_model_286 <- knn(train = train_scale,
                     test = test_scale,
                     cl = train$satisfaction,
                     k = 286)

Model Prediction & Evaluation
Model evaluation using confusion matrix

# Confusion Matrix
confusionMatrix(knn_model_285, test$satisfaction)
## Confusion Matrix and Statistics
## 
##                          Reference
## Prediction                neutral or dissatisfied satisfied
##   neutral or dissatisfied                   12184      1587
##   satisfied                                   552      8265
##                                                  
##                Accuracy : 0.9053                 
##                  95% CI : (0.9014, 0.9091)       
##     No Information Rate : 0.5638                 
##     P-Value [Acc > NIR] : < 2.2e-16              
##                                                  
##                   Kappa : 0.8052                 
##                                                  
##  Mcnemar's Test P-Value : < 2.2e-16              
##                                                  
##             Sensitivity : 0.9567                 
##             Specificity : 0.8389                 
##          Pos Pred Value : 0.8848                 
##          Neg Pred Value : 0.9374                 
##              Prevalence : 0.5638                 
##          Detection Rate : 0.5394                 
##    Detection Prevalence : 0.6097                 
##       Balanced Accuracy : 0.8978                 
##                                                  
##        'Positive' Class : neutral or dissatisfied
## 
confusionMatrix(knn_model_286, test$satisfaction)
## Confusion Matrix and Statistics
## 
##                          Reference
## Prediction                neutral or dissatisfied satisfied
##   neutral or dissatisfied                   12183      1588
##   satisfied                                   553      8264
##                                                  
##                Accuracy : 0.9052                 
##                  95% CI : (0.9013, 0.909)        
##     No Information Rate : 0.5638                 
##     P-Value [Acc > NIR] : < 2.2e-16              
##                                                  
##                   Kappa : 0.805                  
##                                                  
##  Mcnemar's Test P-Value : < 2.2e-16              
##                                                  
##             Sensitivity : 0.9566                 
##             Specificity : 0.8388                 
##          Pos Pred Value : 0.8847                 
##          Neg Pred Value : 0.9373                 
##              Prevalence : 0.5638                 
##          Detection Rate : 0.5394                 
##    Detection Prevalence : 0.6097                 
##       Balanced Accuracy : 0.8977                 
##                                                  
##        'Positive' Class : neutral or dissatisfied
## 
  • The result shows that the accuracy of the KNN model is 90.6% for k = 285 and 90.5% for k = 286.

  • The accuracy is quite high, which means that the model is performing well and making correct predictions for a high percentage of the test data. (Generally, an accuracy of 90% or higher is considered to be very good.)

  • The confusion matrix results indicate that the KNN model with k = 285 and k = 286 have a high accuracy of around 90%, and good sensitivity and specificity.

  • The model is able to correctly predict the majority of the test data points and it may have slightly better performance with k = 285, but the difference is very small.

  • In order to further improve the KNN model, we proceed to find the optimal k value.

# Improve model and find optimal k value
# i = 1
# k.optm = 1
# for (i in 1:25){ 
#     knn <-  knn(train = train_scale, 
#                 test = test_scale, 
#                 cl = train$satisfaction, 
#                 k = i)
#     k.optm[i] <- 100 * sum(test$satisfaction == knn)/NROW(test)
#     k = i
#     cat(k, '=', k.optm[i], '\n')
# }
# 1 = 91.39795 
# 2 = 91.00065 
# 3 = 92.69611 
# 4 = 92.39582 
# 5 = 92.86242 
# 6 = 92.72845 
# 7 = 92.99185 
# 8 = 92.73769 
# 9 = 92.83932 
# 10 = 92.8578 
# 11 = 92.89014 
# 12 = 92.8116 
# 13 = 92.91324 
# 14 = 92.86704 
# 15 = 92.71921 
# 16 = 92.64529 
# 17 = 92.64529 
# 18 = 92.56214 
# 19 = 92.64529 
# 20 = 92.66839 
# 21 = 92.58062 
# 22 = 92.4836 
# 23 = 92.51132 
# 24 = 92.43278 
# 25 = 92.46974 
# plot(k.optm, type="b", xlab="K- Value", ylab="Accuracy level")

# Optimal k: 7

# KNN (use 7)
knn_model_7 <- knn(train = train_scale,
                    test = test_scale,
                    cl = train$satisfaction,
                    k = 7)

# Confusion Matrix
confusionMatrix(knn_model_7, test$satisfaction)
## Confusion Matrix and Statistics
## 
##                          Reference
## Prediction                neutral or dissatisfied satisfied
##   neutral or dissatisfied                   12316      1168
##   satisfied                                   420      8684
##                                                  
##                Accuracy : 0.9297                 
##                  95% CI : (0.9263, 0.933)        
##     No Information Rate : 0.5638                 
##     P-Value [Acc > NIR] : < 2.2e-16              
##                                                  
##                   Kappa : 0.8558                 
##                                                  
##  Mcnemar's Test P-Value : < 2.2e-16              
##                                                  
##             Sensitivity : 0.9670                 
##             Specificity : 0.8814                 
##          Pos Pred Value : 0.9134                 
##          Neg Pred Value : 0.9539                 
##              Prevalence : 0.5638                 
##          Detection Rate : 0.5452                 
##    Detection Prevalence : 0.5970                 
##       Balanced Accuracy : 0.9242                 
##                                                  
##        'Positive' Class : neutral or dissatisfied
## 
  • From the output, it can be seen that the highest accuracy is 93% when k = 7, which means that when k = 7, the model correctly predicted the class for 93% of the test samples. Therefore, 7 is the optimal k value for this model.

4 Results

  • From the outcome, Decision Tree has outperformed the rest by attaining 95% accuracy score, as compared to Logistic Regression and KNN models with 88% and 91% score respectively.
  • Therefore, the airline company should opt for Decision Tree model to perform predictive analysis on determining their passenger satisfaction for further use.

5 Conclusion

Summarize the EDA part and model part. Airline of which class need to focus on what services?

  • In overall, it is suggested that Airlines should give more attention to the four services which are “In-flight Wi-Fi service,” “Ease of Online Booking,” “Departure Arrival time convenient,” and “Gate location” as they are identified to be major sources of dissatisfaction among business travelers.

  • For Personal Travel, the majority of the ratings for these services fall between “1” and “3” which indicates that these services are not performing as well as they could. By taking initiative to improve these services, the ratings could be increased, indicating a significant increase in passenger satisfaction and loyalty.

6 Contribution

YuXuan Lim:

  • Unclean the ready dataset for data cleaning
  • EDA for Gender and Travel Class Attributes
  • Correlation
  • Decision Tree
  • Modelling Result comparison
  • Presentation slide and recording

Li Xin Qi:

  • Data checking and cleaning
  • EDA for Overview and Customer Type
  • Model building for logistic regression and k-means algorithm
  • Presentation

7 Reference

Bhandari, P. (2023, January 9). Descriptive Statistics | Definitions, Types, Examples. Scribbr. https://www.scribbr.com/statistics/descriptive-statistics/
GeeksforGeeks. (2023a, January 10). Decision Tree. https://www.geeksforgeeks.org/decision-tree/
GeeksforGeeks. (2023b, January 11). Basic Concept of Classification (Data Mining). https://www.geeksforgeeks.org/basic-concept-classification-data-mining/
Gupta, P. (2018, June 20). Decision Trees in Machine Learning - Towards Data Science. Medium. https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052
Regression Analysis: Step by Step Articles, Videos, Simple Definitions. (2022, December 1). Statistics How To. https://www.statisticshowto.com/probability-and-statistics/regression-analysis/